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Abstract. The purpose of this note is to show how the method of maximum entropy in 
the mean (MEM) may be used to improve parametric estimation when the measurements 
are corrupted by large level of noise. The method is developed in the context on a concrete 
example: that of estimation of the parameter in an exponential distribution. We compare 
the performance of our method with the bayesian and maximum likelyhood approaches. 

1. Introduction 

Suppose that you want to measure the half-life of a decaying nucleus or the life-time of some 
elementary particle, or some other random variable modeled by an exponential distribution 
describing, say a decay time or a life time of a process. Assume as well that the noise in 
the measurement process can be modeled by a centered gaussian random variable whose 
variance may be of the same order of magnitude as that of the decay rate to be measured. 
To make things worse, assume that you can only collect very few measurements. 

That is if Xj denotes the realized value of the variable, one can only measure r/i = Xi + e^, 
for i = l,2, n, where n is a small mumber, say 2 or 3. In other words, assume that you 
know that the sample comes from a specific parametric distribution but is contaminated 
by additive noise. What to do? One possible approach is to apply small sample statistical 
estimation procedures. But these are designed for problems where the variability is due 
only to the random nature of the quantity measured, and there is no other noise in the 
measurement 

Still another possibility, the one we wat to explore here, is to apply a maxentropic filtering 
method, to estimate both the unknown variable and the noise level. For this we recast the 
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problem as a typical inverse problem consisting of solving for x in 



(1) 



y = Ax + e; x 



G K 



where K is a convex set in M d , y 6 1* and for some d and k, and A is an k x d-matrix 
which depends on how we rephrase the our problem. We could, for example, consider the 
following problem: Find x G [0, oo) such that 



In our case K = [0, oo), and we set y = -"Ejyj. Or we could consider a collection of n 
such problems, one for every measurement, and then proceed to carry on the estimation. 
Once we have solved the generic problem (fTl), the variations on the theme are easy to write 
down. What is important to keep in mind here, is that the output of the method is a filtered 
estimator x* of x, which itself is an estimator of the unknown parameter. The novelty then 
is to to filter out the noise in (J2J). 

The method of maximum entropy in the mean is rather well suited for solving problems 
like (1). See Navaza's [N] for an early development and Dacunha-Castele and Camboa [D-G] 
for full mathematical treatment . Below we shall briefly review what the method is about 
and then apply it to obtain an estimator x from (J2J). In section 3 obtain the maxentropic 
estimator and in section 4 we examine some of its properties, in partcular we examine what 
the results would be if either the noise level were small or the number of measurements 
were large. We devote section 4 to some simulations in whic the methd is compared with a 
bayesian and a maximum likelihood approaches. 



MEM is a technique for transforming a possibly ill-posed linear problem with convex 
constraints into a simpler (possibly unconstrained) but non-linear minimization problem. The 
number of variables in the auxiliary problem being equal to the number of equations in the 



(2) 



y = x + e 



2. The basics of MEM 
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original problem, k in the case of example 1. To carry out the transformation one thinks of 
the x there as the expected value of a random variable X with respect to some measure P 
to be determined. The basic datum is a sample space (f2 s ,jF s ) on which X is to be defined. 
In our setup the natural choice is to take fl s = K, T s = <6(K), the Borel subsets of K, 
and X = idi<; the identity map. Similarly, we think of e as the expected value of a random 
variable V taking values in M. k . The natural choice of sample space here is Q n = M fc and 
T n = B(R k ) the Borel subsets. 

To continue we need to select to prior measures dQ s (£) and dQ n (v) on (f2 s ,jF s ) and 
(QnjJ-'n). The only restriction that we impose on them is that the closure of the convex 
hull of both supp{Q s ) (resp. of supp(Q n )) is K (resp. These prior measures embody 

knowledge that we may have about x and e but are not priors in the Bayesian sense. The two 
pieces are put together setting Q = Q s x Q n ; T — T s <g> jF n , and dQ(£,v) = dQ s {^)dQ n {y). 
And to get going we define the class 

(3) F = {P\P « Q- AE P [X] + E P [V] = y}. 

Note that for any P E P having a strictly positive density p = then i?p[X] e int(K). 
For this standard result in analysis check in Rudin's book [R]. The procedure to explicitly 
produce such P's is known as the maximum entropy method. The first step of which is to 
assume that P ^ 0, which amounts to say that our inverse problem (1) has a solution and 
define 

S Q [-00,00) 

by the rule 

(4) S Q (P) = - Jm^xip 

whenever the function hi(^j) is P-integrable and Sq(P) = —00 otherwise. This entropy 
functional is concave on the convex set P. To guess the form of the density of the measure 
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P* that maximizes Sq is to consider the class of exponential measures on Q defined by 

„-<A,A§>-<A,u> 

(5) iP, = m dQ 
where the normalization factor is 

Z{\) = E Q [e- <x ' M> - <x > v> }. 
Here A6l fc . If we define the dual entropy function 

E(A) : V(Q) -> (-00,00] 

by the rule 

(6) £(A) = lnZ(X)+ < A,y > 

or E(A) = 00 whenever A i V(Q) = {// G M fc | < 00}. 

It is easy to prove that, S(A) > Sq(P) for any A G T>(Q), and any P G P. Thus if we 
were able to find a A* G T>(Q) such that P\* G P, we are done. To find such a A* it suffices 
to minimize (the convex function) S(A) over (the convex set) T>(Q). We leave for the reader 
to verify that if the minimum is reached in the interior of T>(Q), then P\* G P. We direct 
the reader to [B-R] for all about this, and much more. 

3. Entropic Estimators 

Let us now turn our attention to equation (2). Since our estimator is a sample mean of 
an exponential (of unknown parameter) it is natural to assume for the method described in 
section 2, to assume that the prior Q s for X is a T(n,a/n), where a > is our best (or 
prior) guess of the unknown parameter. Similarly, we shall chose Q n to be the distribution 
of a N(0, S/n) random variable as prior for the noise component. 

Things are rather easy under these assumptions. To begin with, note that 

Z(A = 



— + l) n 

na I 



and the typical member dP\(^,v) of the exponential family is now 
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(7) «) = (A + naf^e-^ g*,. 



It is also easy to verify that the dual entropy function S(A) is given by 

2n net 

the whose minimum value is reached at A* satisfying 

A*5 2 1/a 

(8) — + y = 

net 

and, discarding the obvious solution (in Lemma 1 below we shall see why we discard the 
other solution), we are left with 



from which we obtain that 

< 9 » ^ +i 4« i -^ + « i -^) 2+ ^) i/2 ) 

as well as 

x* = e p ^[x] = ^ = [f ((i + y/(i - ±y + ^,) 1/2 ]- 1 



(10) 

V(A*) 



e* = Ep , x . ) \y\ = -e£. 



Comment 1. Clearly, from (Ej) it follows that y = x* + e*. Thus it makes sense to think of 
x* as the estimator with the noise filtered out, and to think of e* as the residual noise. 

4. Properties of x* 

Let us now spell out some of the notation underlying the probabilistic model behind (1). 
We shall assume that the and the in the first section are values of random variables 
X % and e* defined on a sample space (W, W). For each 9 > 0, we assume to be given a 
probability law P(9) on (W, VV), with respect to which the sequences {X k \ k — 1, 2, ...} and 
{e k | k — 1,2, ...} are both i.i.d. and independent of each other, and that with respect to 
P (9), X k ~ exp(#) and e k ~ N(0,S 2 ). That is we consider the underlying model for the 
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noise asour prior model for it. Minimal consistency is all right. Form the above, the following 
basic results are easy to obtain. 
From (9) and (10) it is clear that 

Lemma 1. If we take a — 1/y, then A* = and x* = y and e* = 0. 

Comment 2. Actually it is easy to verify that the solution to x*(a) = 1/a is a = 1/y. 

To examine the case in which large data sets were available, let us add a superscript n 
and write y(n) to emphasize the size of the sample. If denotes te arithmetic mean of an 
iid sequence of random variables having exp(#) as common law, it will follow form the LLN 
that 

Lemma 2. As n — * oo then 

Proof. Start from (10), invoke the LLN to conclude that y(n) tends to 9 and obtain fTTTT) . □ 
Corollary 1. The true parameter is the solution of x(a) — 1/a = 0. 

Proof. Just look at the right hand side of (jTTj) to conclude that x(l/0) = 9. □ 

Comment 3. What this asserts is that when the number of measurements is large, to find 
the right value of the parameter it suffices to solve x(a) — 1/a = 0. 

And when the noise level goes to zero, we have 

Lemma 3. With the notations introduced above, x* — > y as 5 — > 0. 

Proof. When 5 — * 0, the dQ n (v) — * eo(dv) the Dirac point mass at 0. In this case, we just 
set 5 = in (jSJ) and the conclusion follows. □ 

When we choose a = 1/y, the estimator x* happens to be unbiased. 
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Lemma 4. Let 9 denote the true but unknown parameter of the exponential, and Pe(dy) 
have density 

fV -8(y-s) -s 2 /26 2 

*M = /.«■(»-')-' r(B)Vsa * 

for y > and otherwise. With the notations introduced above, we have Ep^[(x^)*} = 1/9 
whenever the prior a for the maxent is the sample mean y. 

Proof. It drops out easily from Lemma 1, from (2) and the fact that the joint density fg of 
y is a convolution. □ 

But the right choice of the parameter a is a pending issue. To settle it we consider once 
more the identity \y — x*\ = |e*|. In our particular case we shall see that a = minimizes 
the right hand side of the previous identity. 

Lemma 5. With the same notations as above, e* happens to be a monotone function of a 
and e*(a = 0) = — \Zv 2 + 4<5 2 ) and e*(a — > oo) = y. In the first case x*(a = 0) = 
+ ^/y 2 + 4<5 2 ) , whereas in the second x*(a — > oo) = 0. 

Proof. Recall from the first lemma that when ay = 1, then e* = 0. A simple algebraic 
manipulation shows that when ay > 1 then e* > 0, and that when ay < 1 then e* < 0.. To 
compute the limit of e* as a — >• oo, note that for large a we can neglect the term 4/5 2 under 
the square root sign, and then the result drops out. It is also easy to check the positivity of 
the derivative of e* with respect to a. Also clearly e*(0) < e*(oo). □ 

To sum up, with the choice a = 0, the entropic estimator and residual error are 

(12) r(0) = ^(y+ v / F+4^), e*(0) = ^{y-yWT4P). 

5. Simulation and comparison with the Bayesian and Maximum Likelihood 

approaches 

In this section we do several things. First, we generate histograms that describe the 
statistical nature of x* as a function of the parameter a. For that we generate a data set of 
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Figure 1. Histogram for E(x) with the Maximum Entropy Method. 

1000 samples, and for each of them we obtain x* from ( fT2"l) . Also, for each data point we 
apply both a Bayesian estimation method and a maximum likelihood method, and plot the 
resulting histograms. 

5.1. The maxentropic estimator. Simulate n = 3 data points 2/i , 2/2 5 2/3 m the following 

way: 

• Simulate a value for Xi from an exponential distribution with parameter A. 

• Simulate a value for from a normal distribution N(0, 5 = 0.5) 

• Sum Xi with e, to get y i} if yi < repeat first two steps until i/i > 

• Do this for i = 1,2, 3. 

• Compute the Maximum entropy estimator given by equation fTTOj) . 

5.2. The bayesian estimator. In this section we derive the algorithm for a Bayesian in- 
ference of the model given by yi = x + ei, for i = 1, 2, n. The classical likelihood estimator 
of x is given by y = - Y^=i Hi- ^ s we know that the unknown mean x has an exponential 
probability distribution with parameter 9 (x ~ E(6')), therefore the joint density of the yi 
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Figure 2. Histogram for E(x) with Bayes Method, 
and \i is proportional to: 

i=l 

where #exp(— Ox) is the density of the unknown mean x and where tt(9) oc 9~ l is the Jeffrey's 
noninformative prior distribution for the parameter 9 (Berger 1985). 

In order for us to compare between the Maximum Entropy Method, Maximum Likelihood 
and Bayesian estimation methods, we need to repeat many times the following steps in order 
to derive a probability distribution of our estimations. 

In order to derive the Bayesian estimator, we need to get the posterior probability distri- 
bution for 9, which we do with the following Gibbs sampling scheme (Robert et al. 2005): 

• Drawx~iv(y-^,^) l x>0 

• Draw 9 ~ E(z) 

Repeat this algorithm many times in order to obtain a large sample from the posterior dis- 
tribution of 9 in order to obtain the posterior distribution of E(x) = For our application, 
we simulate data with 9=1, which gives an expected value for x equal to E(x) = 1. 

We get the following histograms for the estimations of E(x) after 1000 iterations when 
simulating data for 9 = 1. 
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5.3. The Maximum Likelihood estimator. The problem of obtaining a ML estimator 
is complicated in this setup because data points are distributed like 

feit) = f ee-^e^^ds/^TrS 2 ) 

J — oo 

fe(t) = 6e- et+ ^F(S<t) 

where S ~ N (65 2 , 5 2 ). Therefore, after observing t±, t 2 , and t 3 , we get the following likelihood 
that we maximize numerically: 

(14) ^EL^^pi^^). 

i=i 

If we attempted to obtain the ML estimator analytically, we would need to solve 



Notice that as 5 — > this equation tends to j — YTj=\ tj = as expected. We can move 
forward a bit, and integrate by parts each numerator, and after some calculations we arrive 
to 



-t 2 /2S 2 



- - > ti + n5 2 9 - > = 0. 

jr[ 3 jr[ j^ee-^-^e-^i^ds/^i^) 

Trying to solve this equation in 9 is rather hopeless. That is the reason why we carried on 
a numerical maximization procedure on ()14)) . To understand what happens when the noise 
is small, we drop the last term in the last equation and we are left with 

n 



9 



the solution of which is 



_ =i(y + v / ^-45 2 ) 
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Figure 3. Histogram for E(x) with the Maximum Likelihood Method. 

or 9* = 2(j) + \/ y 1 — 45 2 ) , and we see that the effect of noise is to increase the ML 
estimator. In figure 3 we plot the histogram of | obtained by numerically maximaizing (Tl4l) 
for each simulated data point. 

When simulating data for 9 — 1, the MEM, Maximum likelihood and Bayesian histograms 
are all skewed to the right and yield a mean under the three simulated histograms close to 1. 
The MEM method yields a sample mean of 1.3252 with a sample standard deviation of 0.5, 
the Bayesian yields sample means equal to 1.045 and sample standard deviation of 0.5529, 
and the Maximum Likelihood method yields a sample mean of 1.81 with a sample standard 
deviation of 2.29. All the three methods produce right skewed histograms for E(x). The 
MEM and Bayesian method provide better and similar results and more accurate than the 
Maximum Likelihood method. 

6. Concluding remarks 

On one hand, MEM backs up the intuitive belief, according to which, if the yi are all 
the data that you have, it is all right to compute your estimator of the mean for a = 0. 
The MEM and Bayesian methods yield closer results to the true parameter value than the 
maximum likelihood estimator. 
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On the other, and this depends on your choice of priors, MEM provides us with a way 
of modifying those priors, and obtain representations like y = x* + e*; where of course 
x* = x*(y). What we saw above, is that there is a choice of prior distributions such that 
x* = y and e* = 0. 

The important thing is that this is actually true regardless of what the "true" probability 
describing the Xi is. 
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Abstract. The purpose of this note is to show how the method of maximum entropy in 
the mean (MEM) may be used to improve parametric estimation when the measurements 
are corrupted by large level of noise. The method is developed in the context on a concrete 
example: that of estimation of the parameter in an exponential distribution. We compare 
the performance of our method with the bayesian and maximum likelihood approaches. 

1. Introduction 

Suppose that you want to measure the half-life of a decaying nucleus or the life-time of some 
elementary particle, or some other random variable modeled by an exponential distribution 
describing, say a decay time or the life time of a process. Assume as well that the noise 
in the measurement process can be modeled by a centered gaussian random variable whose 
variance may be of the same order of magnitude as that of the decay rate to be measured. 
To make things worse, assume that you can only collect very few measurements. 

That is if Xj denotes the realized value of the variable, one can only measure r/i = + e^, 
for i = 1,2, ...,n, where n is a small mumbler, say 2 or 3, and e\ denotes the additive 
measurement noise. In other words, assume that you know that the sample comes from a 
specific parametric distribution but is contaminated by additive noise. What to do? One 
possible approach is to apply small sample statistical estimation procedures. But these are 
designed for problems where the variability is due only to the random nature of the quantity 
measured, and there is no other noise in the measurement 

Still another possibility, the one we that to explore here, is to apply a maxentropic filtering 
method, to estimate both the unknown variable and the noise level. For this we recast the 
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problem as a typical inverse problem consisting of solving for x in 



(1) 



y = Ax + e; x 



G K 



where K is a convex set in M d , y 6 1* and for some d and k, and A is an k x d-matrix 
which depends on how we rephrase the our problem. We could, for example, consider the 
following problem: Find x G [0, oo) such that 



In our case K = [0, oo), and we set y = -"Ejyj. Or we could consider a collection of n 
such problems, one for every measurement, and then proceed to carry on the estimation. 
Once we have solved the generic problem (fTl), the variations on the theme are easy to write 
down. What is important to keep in mind here, is that the output of the method is a filtered 
estimator x* of x, which itself is an estimator of the unknown parameter. The novelty then 
is to filter out the noise in (J2J). 

The method of maximum entropy in the mean is rather well suited for solving problems like 
(1). See Navaza (1986) for an early development and Dacunha-Castele and Camboa (1990) 
for full mathematical treatment . Below we shall briefly review what the method is about 
and then apply it to obtain an estimator x from (J2J). In section 3 obtain the maxentropic 
estimator and in section 4 we examine some of its properties, in particular we examine what 
the results would be if either the noise level were small or the number of measurements were 
large. We devote section 4 to some simulations in which the method is compared with a 
bayesian and a maximum likelihood approaches. 



MEM is a technique for transforming a possibly ill-posed, linear problem with convex 
constraints into a simpler (usually unconstrained) but non-linear minimization problem. 
The number of variables in the auxiliary problem being equal to the number of equations 



(2) 



y = x + e 



2. The basics of MEM 
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in the original problem, k in the case of example 1. To carry out the transformation one 
thinks of the x there as the expected value of a random variable X with respect to some 
measure P to be determined. The basic datum is a sample space (O s , JF S ) on which X is to 
be defined. In our setup the natural choice is to take Q s = K, T s = -S(K), the Borel subsets 
of K, and X = idj^ the identity map. Similarly, we think of e as the expected value of a 
random variable V taking values in P fc . The natural choice of sample space here is Q n = M. k 
and T n = B(R h ) the Borel subsets. 

To continue we need to select to prior measures dQ s (C,) and dQ n (v) on (f2 s ,jF s ) and 
(f2 n ,jF n ). The only restriction that we impose on them is that the closure of the convex 
hull of both supp(Q s ) (resp. of supp(Q n )) is K (resp. M fe ). These prior measures embody 
knowledge that we may have about x and e but are not priors in the Bayesian sense. Ac- 
tually, the model for the noise component describes the characteristics of the measurement 
device or process, and it is a datum. The two pieces are put together setting Q = Q s x Q n ; 
T = T s ® T n , and <iQ(£, v) = dQ s ((,)dQ n (v). And to get going we define the class 

(3) P = {P | P « Q; AE P [X] + E P [V] = y}. 

Note that for any P G P having a strictly positive density p = then -Ep[X] G int(K). 
For this standard result in analysis check in Rudin's (1973) book. The procedure to explicitly 
produce such P's is known as the maximum entropy method. The first step of which is to 
assume that P ^ 0, which amounts to say that our inverse problem (1) has a solution and 
define 

S Q : P^ [-00,00) 

by the rule 

(4) s Q (P) = - Jm^xip 

whenever the function hi(^j) is P-integrable and Sq(P) = — oo otherwise. This entropy 
functional is concave on the convex set P. To guess the form of the density of the measure 
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P* that maximizes Sq is to consider the class of exponential measures on Q defined by 



e -<A,A§>-<A,ti> 



(5) m = m dQ 

where the normalization factor is 

Z{\) = E Q [e- <x ' M> - <x > v> }. 
Here A6l fc . If we define the dual entropy function 

E(A) : V(Q) -> (-00,00] 

by the rule 

(6) £(A) = lnZ(X)+ < A,y > 

or E(A) = 00 whenever A i V(Q) = {// G M fc | < 00}. 

It is easy to prove that, S(A) > Sq(P) for any A G T>(Q), and any P G P. Thus if we 
were able to find a A* G T>(Q) such that P\* G P, we are done. To find such a A* it suffices 
to minimize (the convex function) S(A) over (the convex set) T>(Q). We leave for the reader 
to verify that if the minimum is reached in the interior of T>(Q), then P\* G P. We direct 
the reader to Borwein and Lewis (2000) for all about this, and much more. 

3. Entropic Estimators 

Let us now turn our attention to equation (2). Since our estimator is a sample mean of 
an exponential (of unknown parameter) it is natural to assume for the method described in 
section 2, to assume that the prior Q s for X is a T(n,a/n), where a > is our best (or 
prior) guess of the unknown parameter. Below we propose a criterion for the best choice of 
a. Similarly, we shall chose Q n to be the distribution of a N(0, S/n) random variable as prior 
for the noise component. 

Things are rather easy under these assumptions. To begin with, note that 
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and the typical member dP\(^,v) of the exponential family is now 



(7) dP x (£, v = A + nar^—e-^ +n ^ , 1/2 ^dv. 

1 (n) {2iro z /n) L i z 

It is also easy to verify that the dual entropy function E(A) is given by 

£(A) = ^-™hi(— + 1) + Ay 

2n na 

the whose minimum value is reached at A* satisfying 

8 <^+y = 

n — + 1 



na 



and, discarding one of the solutions (because it leads to a negative estimator of a positive 
quantity), we are left with 

- = k-o- + -4) + (a - -4) 2 + -4) 1/2 ) 

from which we obtain that 

( 9 ) — + 1 = - -4) + (a - -4) 2 + -4) 1/2 ) 

no; 2 ao z ao z a z o z 

as well as 

(10) = epm[x] = ^ = tf ((i - A) + ^/(i - ^ + ^n- 1 

e* = £ P(A *)[V]=-^. 

Comment 1. Clearly, from it follows that y = x* + e*. T/ius if makes sense to think of 
x* as the estimator with the noise filtered out, and to think of e* as the residual noise. 

4. Properties of x* 

Let us now spell out some of the notation underlying the probabilistic model behind (1). 
We shall assume that the Xi and the e» in the first section are values of random variables 
X % and e* defined on a sample space (W, W). For each > 0, we assume to be given a 
probability law P(9) on (W, W), with respect to which the sequences | = 1,2,...} 
and {e k \ k = 1,2, ...} are both i.i.d. and independent of each other, and that with respect 
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to P{6), X k ~ exp(#) and e k ~ N(0,S 2 ). That is we consider the underlying model for 
the noise as our prior model for it. Minimal consistency is all right. Form the above, the 
following basic results are easy to obtain. 
From (9) and (10) it is clear that 

Lemma 1. // we take a = 1/y, then A* = and x* = y and e* = 0. 

Comment 2. Actually it is easy to verify that the solution to x*(a) = 1/a is a = 1/y. 

To examine the case in which large data sets were available, let us add a superscript n 
and write y(n) to emphasize the size of the sample. If denotes the arithmetic mean of 
an i.i.d. sequence of random variables having exp(#) as common law, it will follow form the 
LLN that 

Lemma 2. As n — > oo then 

(11) ( i <»»)^ iWs| |( ( l-^ ) + ( (l-^ + _i,)V,) ] ,-.». 

Proof. Start from (10), invoke the LLN to conclude that y(n) tends to 9 and obtain (TTTT) . □ 

Corollary 1. The true parameter is the solution of x(a) — 1/a = 0. 

Proof. Just look at the right hand side of (fiT|) to conclude that x(l/8) = 9. □ 

Comment 3. What this asserts is that when the number of measurements is large, to find 
the right value of the parameter it suffices to solve x(a) — 1/a = 0. 

And when the noise level goes to zero, we have 

Lemma 3. With the notations introduced above, x* — > y as 5 — > 0. 

Proof. When 5^0, the dQ n (v) — > eo(dv) the Dirac point mass at 0. In this case, we just 
set 5 = in (jSJ) and the conclusion follows. □ 

When we choose a = 1/y, the estimator x* happens to be unbiased. 
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Lemma 4. Let 9 denote the true but unknown parameter of the exponential, and Pe(dy) 
have density 

fV -8(y-s) -s 2 /26 2 

*M = /.«■(»-')-' r(B)Vsa * 

for y > and otherwise. With the notations introduced above, we have Ep^[(x^)*} = 1/9 
whenever the prior a for the maxent is the sample mean y. 

Proof. It drops out easily from Lemma 1, from (2) and the fact that the joint density fg of 
y is a convolution. □ 

But the right choice of the parameter a is still a pending issue. To settle it we consider 
once more the identity \y — x*\ = \e*\. In our particular case we shall see that a = minimizes 
the right hand side of the previous identity. Thus, we propose to choose 'alpha to minimize 
the residual or reconstruction error. 

Lemma 5. With the same notations as above, e* happens to be a monotone function of a 
and e*(a = 0) = |(y — \/ y 1 + 4<5 2 ) and e*[a — * oo) = y. In the first case x*(a = 0) = 
+ \/y 2 + 4<5 2 ), whereas in the second x*(a — > oo) = 0. 

Proof. Recall from the first lemma that when ay = 1, then e* = 0. A simple algebraic 
manipulation shows that when ay > 1 then e* > 0, and that when ay < 1 then e* < 0.. To 
compute the limit of e* as a — * oo, note that for large a we can neglect the term 4/S 2 under 
the square root sign, and then the result drops out. It is also easy to check the positivity of 
the derivative of e* with respect to a. Also clearly |e*(0)| < |e*(oo)|. □ 

To sum up, with the choice a = 0, the entropic estimator and residual error are 

(12) x*(0) = 1 -{y+VfT4f 2 ), e*(0) = ^(y-v / F+4^). 

5. Simulation and comparison with the Bayesian and Maximum Likelihood 

approaches 

In this section we compare the proposed maximimum entropy in the mean procedure with 
the bayesian and maximum likelihood estimation procedures. We do that simulating data 
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Figure 1. Histogram for E(x) with the Maximum Entropy Method. 

and carrying out the three procedures and plotting the histograms of the corresponding 
histograms. First, we generate histograms that describe the statistical nature of x as a 
function of the parameter a. For that we generate a data set of 1000 samples, and for 
each of them we obtain x* from (fT2|) . Also, for each data point we apply both a Bayesian 
estimation method and a maximum likelihood method, and plot the resulting histograms. 

5.1. The maxentropic estimator. The simulated data process goes as follows. For n = 3 
the data points 2/1,2/2,1/3 are obtained in the following way: 

• Simulate a value for X{ from an exponential distribution with parameter 8(= 1). 

• Simulate a value for from a normal distribution iV(0, 5 = 0.5) 

• Sum Xi with e, to get j/j, if j/j < repeat first two steps until j/j > 

• Do this for i — 1, 2, 3. 

• Compute the Maximum entropy estimator given by equation (fTUl) . 
We then sdisplay the resulting histogram in Figure 1. 

5.2. The bayesian estimator. In this section we derive the algorithm for a Bayesian in- 
ference of the model given by i/i = x + Ci, for i = 1, 2, n. The classical likelihood estimator 
of x is given by y = - Y17=i Vi- ^ s we know that the unknown mean x has an exponential 
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Figure 2. Histogram for E(x) with Bayes Method. 

probability distribution with parameter 9 (x ~ therefore the joint density of the t/i 

and \x is proportional to: 



where 9 exp(—9x) is the density of the unknown mean x and where tt(9) oc 0" 1 is the Jeffrey's 
non informative prior distribution for the parameter 9 Berger (1985). 

In order to derive the Bayesian estimator, we need to get the posterior probability distri- 
bution for 9, which we do with the following Gibbs sampling scheme, described in Robert 
and Casella (2005): 



• Draw 9 ~ E(x) 

Repeat this algorithm many times in order to obtain a large sample from the posterior dis- 
tribution of 9 in order to obtain the posterior distribution of E(x) = \. For our application, 
we simulate data with 9=1, which gives an expected value for x equal to E(x) = 1. 

We get the histogram displayed in Figure 2 for the estimations of E(x) after 1000 iterations 
when simulating data for 9=1. 
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• Draw x ~ N (y - — , — 

y n 1 n 
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5.3. The Maximum Likelihood estimator. The problem of obtaining a ML estimator 
is complicated in this setup because data points are distributed like 

feit) = f ee-^e^^ds/^TrS 2 ) 

J — oo 

fe(t) = 6e- et+ ^F(S<t) 

where S ~ N(9S 2 ,S 2 ). Therefore, after observing t 1; t 2 , and t 3 , we get the following 
likelihood that we maximize numerically: 

(14) ^EL^J[p (5<t!) . 

i=i 

If we attempted to obtain the ML estimator analytically, we would need to solve 



n 



Notice that as 5 — > this equation tends to j — YTj=\ tj = as expected. We can move 
forward a bit, and integrate by parts each numerator, and after some calculations we arrive 
to 



-t 2 /2S 2 



- - > ti + n5 2 9 - > —, = 0. 

o jr[ 3 jr[ j^ee-^-^e-^i^ds/^i^) 

Trying to solve this equation in 9 is rather hopeless. That is the reason why we carried on 
a numerical maximization procedure on (fT4"|) . To understand what happens when the noise 
is small, we drop the last term in the last equation and we are left with 

n 
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the solution of which is 



_ =i(y + v / ^-45 2 ) 
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Figure 3. Histogram for E(x) with the Maximum Likelihood Method. 

or 9* = 2{ij + \/ y 1 — 45 2 ) , and we see that the effect of noise is to increase the ML 
estimator. In figure 3 we plot the histogram of \ obtained by numerically maximizing ({141) 
for each simulated data point. 

When simulating data for 9=1, the MEM, Maximum likelihood and Bayesian histograms 
are all skewed to the right and yield a mean under the three simulated histograms close to 1. 
The MEM method yields a sample mean of 1.3252 with a sample standard deviation of 0.5, 
the Bayesian yields sample means equal to 1.045 and sample standard deviation of 0.5529, 
and the Maximum Likelihood method yields a sample mean of 1.81 with a sample standard 
deviation of 2.29. All the three methods produce right skewed histograms for E(x). The 
MEM and Bayesian method provide better and similar results and more accurate than the 
Maximum Likelihood method. 

6. Concluding remarks 

On one hand, MEM backs up the intuitive belief, according to which, if the yi are all 
the data that you have, it is all right to compute your estimator of the mean for a = 0. 
The MEM and Bayesian methods yield closer results to the true parameter value than the 
maximum likelihood estimator. 
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On the other, and this depends on your choice of priors, MEM provides us with a way 
of modifying those priors, and obtain representations like y = x* + e*; where of course 
x* = x*(y). What we saw above, is that there is a choice of prior distributions such that 
x* = y and e* = 0. 

The important thing is that this is actually true regardless of what the "true" probability 
describing the Xi is. 
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